The future of generative AI is using it less

The boring business problem at the heart of the generative AI bubble

In my experience, the most effective criticism comes from a place of practical understanding of the subject matter. It's why I own a little bit of Bitcoin and Ethereum and one NFT, and why I've been forcing myself to use generative AI more, even if it's not proving all that valuable for me in my day-to-day work. A few months ago, I spent a day generating eminent theorist Namfits Ezelbergasz, and lately I've been working on a collaborative narrative project that I'm actually pretty excited to tell you about.

While reading the latest research on generating narratives with LLMs, I found something pretty surprising: a majority of the studies' methods aren't really concerned with LLMs. I expected them to describe statistical methods for training models or steering outputs. Instead, most tend to focus on a problem that will be familiar to both developers and writers: state management.

One of the reasons LLMs struggle to write good stories at length is because they can't efficiently handle very much new information. As a result, when generating stories they routinely produce wacky results, like bringing characters back from the dead or changing locations mid-scene.

Reading these papers is a somewhat surreal experience, because despite being wrapped in a lot of technical jargon, most of them describe researchers trying to figure out what english prompts to write, in what order, to get an LLM to generate the next part of a story according to some kind of narrative framework while keeping track of what came before it. A lot of the work concerns breaking down the storytelling process into smaller sessions, called "agents," that require less input to function, and then automatically extracting the changing "state" of a story – information about characters, locations, events, and so on – from these agents so it can be stored outside of the LLM and loaded back into it later. So, a lot of "summarize the last series of events in 140-character bullet points," "output a list of the character's items in this JSON format," and so on.

The industry calls this bottleneck the "context window," which you can think of as the LLM's short-term memory. LLMs are pre-trained on giant corpuses of material in a way that enables them to call on that material and generate new responses relatively efficiently. But the new content a user inputs — a text prompt, a reference image, a codebase, etc. — has to be processed on the fly. The longer a session with an LLM continues, the bigger the context window gets, because it needs to store the history of the session's user prompts and all of its responses to generate the next one.

This processing of new information, which the industry calls "inference," is why querying an LLM is computationally expensive in the first place, and the longer the context window is, the more expensive, and slower, inference becomes. In fact, the compute required for inference generally scales quadratically to the length of the combined input and output, meaning that if you double your context window, you need four times as much compute to return a result.

It's worth noting that with the latest models, you could theoretically write an entire novel within a single LLM session. Google's latest Gemini 2.5 models, for example, support large context windows of up to 1 million tokens, which they claim can store eight english-language books. And Gemini gives users the ability to cache contexts at a cheaper rate.

There are just a few problems with this idea. For one, it's very expensive to use big context windows in inputs, and even more expensive to output large amounts of text. Reddit is littered with stories of people (often developers working with large codebases) unexpectedly racking up hundreds of dollars of charges per day working with long contexts, and that's with relatively small outputs — maybe a couple of code files, not thousands of words of writing.

Another problem is that although eight novels' worth of context sounds like a lot, remember that every new query adds to the length of the previous one. So if you ask an LLM to add 5,000 words to a 100,000 word story that's already in its context, that new query "costs" another 100,000 words of input and 5,000 words of output, and the next will cost 105,000 words of input, and so on. Given the amount of re-prompting required to get good results from LLMs, it wouldn't actually take much writing in a session to exceed even that high of a context limit.

Most importantly, LLMs aren't actually very good at using long contexts, particularly when they're asked to keep track of lots of different bits of information like you need to do when writing a story. In fact, Google's own documentation page suggests that heavy caching is necessary to achieve good performance at a reasonable cost, something that's hard to do with a wall of narrative text that you need to be able to edit from beginning to end frequently.

In discussing these studies and limitations with a developer friend of mine, George King, what became clear to both of us is that these kinds of problems aren't particularly novel. Once you reach a certain scale, most software development is fundamentally about optimizing the movement of information through computationally expensive bottlenecks. This usually means trying to optimize and minimize your application's most expensive tasks, which in the case of AI applications, is querying the LLM. In fact, optimizing the process of shuffling information in and out of an LLM's context window is basically the business of most so-called "AI wrapper" tools, including popular developer tools like Cursor and Cline.

I've no doubt that long context performance, and pricing, will improve over time. But what this all points to is that LLMs are not immune from the fundamental principles of the software business. Because of the relative expense of querying an LLM, particularly a foundation model, companies will naturally start to invest in lower-cost tools that surround it, using caches, classic NLP tools, and even smaller, fine-tuned models to "gatekeep" calls to the big expensive LLM, as George put it.

That businesses will seek to reduce costs wherever possible shouldn't surprise anyone, but I think it's actually a pretty big problem for the profitability theory of the generative AI industry as a whole. Almost all of the big AI companies are already losing billions of dollars a year on those portions of their businesses, in part because they're massively subsidizing the cost of inference to gain adoption. They're able to justify those losses for now on the assumption that generative AI is unlike anything that has come before it, capable of replacing entire professions with lower-cost alternatives.

If that's not true — if, indeed, it's subject to the same rules as every other kind of software, it seems more likely that instead of replacing entire professions, generative AI will simply become an expensive part of the software stack that companies will have to learn how to deploy efficiently, likely using wraparound optimizations built primarily by humans. In fact, we're already seeing hints that this may be the case. Even at today's drastically subsidized rates, businesses are increasingly reporting that they aren't seeing returns on their AI investments. Meanwhile, as this week's Cursor pricing kerfluffle showed, today's leading AI wrapper tools still aren't valuable enough to survive any kind of price hike or service degradation — and it seems unlikely this one would get them even close to profitability.

Generative AI is probably going to be valuable software. But for the moment, anyway, it's still software, and that's a problem for an entire industry predicated on it become something much more.

don't subscribe

Careful: if you subscribe, critical and independent thinking about media, technology, politics, and society (and the occassional non sequitor about bread) will be sent to your inbox.
[email protected]
Subscribe